58 research outputs found
Establishing a New State-of-the-Art for French Named Entity Recognition
The French TreeBank developed at the University Paris 7 is the main source of
morphosyntactic and syntactic annotations for French. However, it does not
include explicit information related to named entities, which are among the
most useful information for several natural language processing tasks and
applications. Moreover, no large-scale French corpus with named entity
annotations contain referential information, which complement the type and the
span of each mention with an indication of the entity it refers to. We have
manually annotated the French TreeBank with such information, after an
automatic pre-annotation step. We sketch the underlying annotation guidelines
and we provide a few figures about the resulting annotations
KOMBINASI PATI JANENG-KITOSAN DENGAN BAHAN PENGAWET ALAMI KUNYIT DAN ASAM ASKORBAT SEBAGAI EDIBLE COATING
Edible coating merupakan salah satu solusi dalam pengawetan makanan saat ini. Pada penelitian ini dibuat edible coating dari komposit edible film pati janeng-kitosan-pengawet alami (kunyit dan asam askorbat) yang diaplikasikan terhadap bakso dan keju. Komposisi terbaik dari komposit edible film dengan perbandingan pati-kitosan-kunyit (1,2% : 0,4% : 0,375%) dan pati-kitosan-asam askorbat (1,2% : 0,4% : 0,5%) yang diperoleh berdasarkan uji kuat tarik, elongasi dan warna digunakan sebagai aplikasi edible coating. Uji antimikrobial menunjukkan edible film yang dikombinasikan dengan kunyit dan asam askorbat mampu menghambat pertumbuhan bakteri E. Coli dengan diameter zona hambat yang lebih besar daripada edible film pati janeng-kitosan yaitu masing-masingnya 7 mm. Pelapisan sampel keju dengan edible coating mampu menurunkan jumlah pertumbuhan mikroba, menghambat terjadinya oksidasi lemak hingga 50% dan memperkecil kenaikan kadar air hingga 41,17% selama 3 bulan penyimpanan dibandingkan dengan keju yang tidak dilapisi edible coating, sedangkan coating pada sampel bakso mampu memperkecil kenaikan kadar air hingga 20,76%, analisis sensori terhadap sampel bakso yang dilapisi edible coating dari segi aroma, tekstur dan warna menyarankan bahwa bakso yang dilapisi lebih baik dibandingkan bakso yang tidak dilapisi setelah penyimpanan selama 3 hari. Kata kunci: Edible coating, pati janeng, kitosan, kunyit, asam askorbat, antimikrobaBanda Ace
A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages
We use the multilingual OSCAR corpus, extracted from Common Crawl via
language classification, filtering and cleaning, to train monolingual
contextualized word embeddings (ELMo) for five mid-resource languages. We then
compare the performance of OSCAR-based and Wikipedia-based ELMo embeddings for
these languages on the part-of-speech tagging and parsing tasks. We show that,
despite the noise in the Common-Crawl-based OSCAR data, embeddings trained on
OSCAR perform much better than monolingual embeddings trained on Wikipedia.
They actually equal or improve the current state of the art in tagging and
parsing for all five languages. In particular, they also improve over
multilingual Wikipedia-based contextual embeddings (multilingual BERT), which
almost always constitutes the previous state of the art, thereby showing that
the benefit of a larger, more diverse corpus surpasses the cross-lingual
benefit of multilingual embedding architectures
Un jeu de données pour la détection automatique de lieux dans les textes français modernes
International audienc
How OCR Performance can Impact on the Automatic Extraction of Dictionary Content Structures
International audienc
SinNer@Clef-Hipe2020 : Sinful adaptation of SotA models for Named Entity Recognition in French and German
International audienceIn this article we present the approaches developed by the Sorbonne-INRIA for NER (SinNer) team for the CLEF-HIPE 2020 challenge on Named Entity Processing on old newspapers. The challenge proposed various tasks for three languages, among them we focused on Named Entity Recognition in French and German texts. The best system we proposed ranked third for these two languages, it uses FastText em-beddings and Elmo language models (FrELMo and German ELMo). We show that combining several word representations enhances the quality of the results for all NE types and that the segmentation in sentences has an important impact on the results
How OCR Performance can Impact on the Automatic Extraction of Dictionary Content Structures
International audienc
Establishing a New State-of-the-Art for French Named Entity Recognition
Due to COVID19 pandemic, the 12th edition is cancelled. The LREC 2020 Proceedings are available at http://www.lrec-conf.org/proceedings/lrec2020/index.htmlInternational audienceThe French TreeBank developed at the University Paris 7 is the main source of morphosyntactic and syntactic annotations for French. However, it does not include explicit information related to named entities, which are among the most useful information for several natural language processing tasks and applications. Moreover, no large-scale French corpus with named entity annotations contain referential information, which complement the type and the span of each mention with an indication of the entity it refers to. We have manually annotated the French TreeBank with such information, after an automatic pre-annotation step. We sketch the underlying annotation guidelines and we provide a few figures about the resulting annotations
French Contextualized Word-Embeddings with a sip of CaBeRnet: a New French Balanced Reference Corpus
International audienceThis paper describes and compares the impact of different types and size of training corpora on language models like ELMO. By asking the fundamental question of quality versus quantity we evaluate four French corpora for training on parsing scores, POS-tagging and named-entities recognition downstream tasks. The paper studies the relevance of a new corpus, CaBeRnet, featuring a representative range of language usage, including a balanced variety of genres (oral transcriptions, newspapers, popular magazines, technical reports, fiction, academic texts), in oral and written styles. We hypothesize that a linguistically representative and balanced corpora will allow the language model to be more efficient and representative of a given language and therefore yield better evaluation scores on different evaluation sets and tasks
CamemBERT: a Tasty French Language Model
Pretrained language models are now ubiquitous in Natural Language Processing.
Despite their success, most available models have either been trained on
English data or on the concatenation of data in multiple languages. This makes
practical use of such models --in all languages except English-- very limited.
In this paper, we investigate the feasibility of training monolingual
Transformer-based language models for other languages, taking French as an
example and evaluating our language models on part-of-speech tagging,
dependency parsing, named entity recognition and natural language inference
tasks. We show that the use of web crawled data is preferable to the use of
Wikipedia data. More surprisingly, we show that a relatively small web crawled
dataset (4GB) leads to results that are as good as those obtained using larger
datasets (130+GB). Our best performing model CamemBERT reaches or improves the
state of the art in all four downstream tasks.Comment: ACL 2020 long paper. Web site: https://camembert-model.f
- …